5  Step 3 - Check Species Validity

6 Problem description

In databases it is frequent to have a species taxa list. Since most of the lists are filled by humans it is expected that the taxa names have typos and different styles of determining unindentified species notation. In that sense it is necessary to check and correct the taxa name using services specifics to that end.

Since we’re checking crossings obtained by camera traps it is normal that some taxa couldn’t be identified in species level. This represents a more difficult approach as we have to consider the lowest possible taxonomic level. In this context we have to have solutions that tackle the validity of several ways of filling the species field.

7 Problem solving

7.1 Common steps

To solve this issue, we follow some of the first basic steps from previous checks, as using our customized read_sheet function that provides the full paths of all .xlsx files available in order to read the species sheet from all files.

Code
source("R/FUNCTIONS.R")
Code
spreadsheets <- read_sheet(path = "Excel", results = FALSE)

sp_full <- purrr::map(.x = spreadsheets, function(arquivo){
  readxl::read_excel(
    arquivo,
    sheet = 6,
    na = c("NA", "na"),
    col_types = c("guess", "guess", "guess", "date", "guess", "guess", "guess"),
    col_names = TRUE
  )
})
Warning: Expecting date in D2996 / R2996C4: got 'xx/07/2019'
Warning: Expecting date in D2997 / R2997C4: got 'xx/07/2019'
Warning: Expecting date in D2998 / R2998C4: got 'xx/07/2019'
Warning: Expecting date in D2999 / R2999C4: got 'xx/07/2019'
Warning: Expecting date in D3000 / R3000C4: got 'xx/07/2019'
Warning: Expecting date in D3001 / R3001C4: got 'xx/07/2019'
Warning: Expecting date in D3002 / R3002C4: got 'xx/07/2019'
Code
head(sp_full[1][[1]]) # show head of first file
# A tibble: 6 × 7
  Structure_ID Camera_ID Species         Record_date         Record_time        
  <chr>        <chr>     <chr>           <dttm>              <dttm>             
1 PSF2         CAM63309  Homo sapiens    2017-10-26 00:00:00 1899-12-31 15:05:00
2 PSF1         CAM63309  Urocyon cinere… 2017-12-13 00:00:00 1899-12-31 00:53:00
3 PSF3         CAM63309  Homo sapiens    2018-03-14 00:00:00 1899-12-31 11:04:00
4 PSF1         CAM63309  Chiroptera      2018-04-16 00:00:00 1899-12-31 20:54:00
5 PSF1         CAM63309  Chiroptera      2018-04-17 00:00:00 1899-12-31 01:24:00
6 PSF1         CAM63309  Chiroptera      2018-04-17 00:00:00 1899-12-31 02:45:00
# ℹ 2 more variables: Record_criteria <dbl>, Behavior <chr>

7.2 Specific steps

7.2.1 Full species

We first start keeping only valid (full) species. In this sense, we are considering only two-worded terms that doesn’t have sp, spp, ni and similars, which are commonly used to designate unidentified species.

Code
species_all_check <- sp_full |>
  purrr::map(function(x) {
    x |>
      dplyr::distinct(Species) |>  # unique values for Species
      dplyr::mutate(Species = stringr::str_squish(Species)) |> # remove whitespaces
      dplyr::filter(
        stringr::str_count(Species, " ") == 1,
        !stringr::str_detect(stringr::word(Species, 2, 2), "\\."),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^sp$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^sp(?=\\.)"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^spp(?=\\.)"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "\\("),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^ni$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^NI$"),
        !stringr::str_detect(stringr::word(Species, 2, 2), "^NID$"),
        !stringr::str_detect(Species, "\\/")
      ) |>
      dplyr::arrange(Species) |> # alphabetical order
      dplyr::pull() # vector
  })

head(species_all_check[1][[1]]) # show head of first list
[1] "Canis familiaris"    "Cuniculus paca"      "Dasyprocta punctata"
[4] "Eira barbara"        "Homo sapiens"        "Meleagris ocellata" 

Since there is a chance that some of the species name have multiple types of spelling considering trailing spaces, we check for names that are similar.

Code
species_all_check |>
  purrr::map(function(x){
    table <- table(x)

    table[table > 1]
  }) |>
  purrr::keep(~ any(.x > 1))
named list()

Having the full species list, we use the Global Names Verifier API (https://verifier.globalnames.org/) to check the species names. We opted to do it through Integrated Taxonomic Information System (ITIS) which is data source = 3 on the address for the API. We use the package httr to help on checking the API.

For each dataset, we checked all full species names. By the end of the code chunk, we unnested the columns bestResult and scoreDetails that come originally as a data frame from the Global Names Verifier. Following this procedure, we compiled the species results in a single data frame for all species for each dataset.

Code
list_check_globalnames <- list()

for (dataset in names(species_all_check)) {

  species <- species_all_check[[dataset]]

  message(stringr::str_glue("Starting dataset {dataset}"))

  for (sp in species) {
    sp_ <- stringr::str_replace(sp, " ", "_")

    result <- httr::GET(stringr::str_glue("https://verifier.globalnames.org/api/v1/verifications/{sp_}?data_sources=3")) # the link for the API check

    list_check_globalnames[[dataset]][[sp_]] <- jsonlite::fromJSON(rawToChar(result$content))[["names"]] # save the part that interests us on a list composed by the dataset and the species name
  }
  # bind the species list on a single data frame unnesting the columns that are a data frame
  list_check_globalnames[[dataset]][["all_results"]] <- list_check_globalnames[[dataset]] |>
    dplyr::bind_rows() |>
    tidyr::unnest(cols = c(bestResult), names_repair = "unique") |>
    tidyr::unnest(cols = c(scoreDetails), names_repair = "unique") |>
    tibble::as_tibble()
}

list_check_globalnames[[1]][["all_results"]]
# A tibble: 10 × 40
   id          name  cardinality matchType...4 dataSourceId dataSourceTitleShort
   <chr>       <chr>       <int> <chr>                <int> <chr>               
 1 1f6805ff-9… Cani…           2 Exact                    3 ITIS                
 2 3b28ba6f-d… Cuni…           2 Exact                    3 ITIS                
 3 f9c5dc3d-1… Dasy…           2 Exact                    3 ITIS                
 4 510380ef-b… Eira…           2 Exact                    3 ITIS                
 5 bc81c617-d… Homo…           2 Exact                    3 ITIS                
 6 91cac633-b… Mele…           2 Exact                    3 ITIS                
 7 51c9b515-3… Nasu…           2 Exact                    3 ITIS                
 8 b8368875-0… Pant…           2 Exact                    3 ITIS                
 9 a5261dbc-4… Puma…           2 Exact                    3 ITIS                
10 feb4010c-8… Uroc…           2 Exact                    3 ITIS                
# ℹ 34 more variables: curation...7 <chr>, recordId <chr>, outlink <chr>,
#   entryDate <chr>, sortScore <dbl>, matchedNameID <chr>, matchedName <chr>,
#   matchedCardinality <int>, matchedCanonicalSimple <chr>,
#   matchedCanonicalFull <chr>, currentRecordId <chr>, currentNameId <chr>,
#   currentName <chr>, currentCardinality <int>, currentCanonicalSimple <chr>,
#   currentCanonicalFull <chr>, taxonomicStatus <chr>, isSynonym <lgl>,
#   classificationPath <chr>, classificationRanks <chr>, …

The next step consisted in creating a full data frame of all the species from all the datasets. We mapped the all_results list from each dataset and then stacked them on a single data frame.

Code
list_sp <- list_check_globalnames |>
  purrr::map("all_results") |>
  dplyr::bind_rows(.id = "dataset") |>
  janitor::clean_names() |>
  dplyr::mutate(query = stringr::str_replace_all(name, "_", " "), .after = name)

head(list_sp)
# A tibble: 6 × 42
  dataset          id        name  query cardinality match_type_4 data_source_id
  <chr>            <chr>     <chr> <chr>       <int> <chr>                 <int>
1 Alberto_Gonzalez 1f6805ff… Cani… Cani…           2 Exact                     3
2 Alberto_Gonzalez 3b28ba6f… Cuni… Cuni…           2 Exact                     3
3 Alberto_Gonzalez f9c5dc3d… Dasy… Dasy…           2 Exact                     3
4 Alberto_Gonzalez 510380ef… Eira… Eira…           2 Exact                     3
5 Alberto_Gonzalez bc81c617… Homo… Homo…           2 Exact                     3
6 Alberto_Gonzalez 91cac633… Mele… Mele…           2 Exact                     3
# ℹ 35 more variables: data_source_title_short <chr>, curation_7 <chr>,
#   record_id <chr>, outlink <chr>, entry_date <chr>, sort_score <dbl>,
#   matched_name_id <chr>, matched_name <chr>, matched_cardinality <int>,
#   matched_canonical_simple <chr>, matched_canonical_full <chr>,
#   current_record_id <chr>, current_name_id <chr>, current_name <chr>,
#   current_cardinality <int>, current_canonical_simple <chr>,
#   current_canonical_full <chr>, taxonomic_status <chr>, is_synonym <lgl>, …

Since we want only the errors, we filtered the column match_type_4 to show every row in which the result was not “Exact”. That means that every species in which the query and the result was not the exact same term were selected to further evaluation.

Code
sp_with_errors <- list_sp |>
  dplyr::filter(match_type_4 != "Exact")

head(sp_with_errors)
# A tibble: 6 × 42
  dataset           id       name  query cardinality match_type_4 data_source_id
  <chr>             <chr>    <chr> <chr>       <int> <chr>                 <int>
1 Ana_Delciellos    46334ff… Guer… Guer…           2 PartialExact              3
2 Diego_Varela      46334ff… Guer… Guer…           2 PartialExact              3
3 Diego_Varela      fa20ddb… Maza… Maza…           2 PartialExact              3
4 Diego_Varela      ca56c67… Sylv… Sylv…           2 PartialExact              3
5 EcoRioMinas_IBAMA 46334ff… Guer… Guer…           2 PartialExact              3
6 EricaSaito        c6eeb58… Dico… Dico…           2 PartialExact              3
# ℹ 35 more variables: data_source_title_short <chr>, curation_7 <chr>,
#   record_id <chr>, outlink <chr>, entry_date <chr>, sort_score <dbl>,
#   matched_name_id <chr>, matched_name <chr>, matched_cardinality <int>,
#   matched_canonical_simple <chr>, matched_canonical_full <chr>,
#   current_record_id <chr>, current_name_id <chr>, current_name <chr>,
#   current_cardinality <int>, current_canonical_simple <chr>,
#   current_canonical_full <chr>, taxonomic_status <chr>, is_synonym <lgl>, …

After checking the list of species considered not “Exact”, we found that some species that were not “Exact” must be whitelisted, since we are sure that the name is valid (for example, checking the List of Brazilian Mammals from the Brazilian Mastozoological Society). They can be appended to the names that were considered as “Exact”. This is the last step for the “Full species” inspection.

Code
sp_whitelist <- list_sp |>
  dplyr::filter(match_type_4 == "Exact") |>
  dplyr::pull(query) |>
  append(c("Guerlinguetus brasiliensis", "Guerlinguetus ingrami")) |> # manually insert species that we know that are correct but the API don't think they are.
  unique() |>
  sort()

head(sp_whitelist)
[1] "Alouatta guariba"        "Alouatta macconnelli"   
[3] "Amazonetta brasiliensis" "Ameiva ameiva"          
[5] "Aotus nigriceps"         "Aphelocoma wollweberi"  

7.2.2 Imprecise taxa

First of all we have to filter for the species that were not on the query for full species - meaning that all of the terms that were not considered as full species still have to be evaluated.

Code
non_species_all_check <- purrr::map2(sp_full, species_all_check, function(x, y){
  x |>
    dplyr::distinct(Species) |>
    dplyr::mutate(Species = stringr::str_squish(Species)) |>
    dplyr::pull(Species) |>
    setdiff(y)
}) |>
  purrr::compact()

head(non_species_all_check[[1]])
[1] "Chiroptera"    "Passeriformes" "Didelphis sp." "Momotus sp."  

We perform the same approach as we did for the full species, this time for the terms that are not full. By the end, we create a data frame that comprises all terms that were not considered as “Exact” on the query from the API, as well as queries that involved terms as “NI” or “spp”.

Code
list_non_sp_with_errors <- list()

for (dataset in names(non_species_all_check)) {

  species <- non_species_all_check[[dataset]]

  message(stringr::str_glue("Starting dataset {dataset}"))

  for (sp in species) {
    sp_ <- sp |>
      stringr::str_remove_all("[[:punct:]]") |>
      stringr::str_replace_all(pattern = " ", replacement = "_")

    result <- httr::GET(stringr::str_glue("https://verifier.globalnames.org/api/v1/verifications/{sp_}?data_sources=3")) # the link for the API check

    list_non_sp_with_errors[[dataset]][[sp_]] <- jsonlite::fromJSON(rawToChar(result$content))[["names"]] # save the part that interests us on a list composed by the dataset and the species name
  }
  # bind the species list on a single data frame unnesting the columns that are a data frame
  list_non_sp_with_errors[[dataset]][["all_results"]] <- list_non_sp_with_errors[[dataset]] |>
    dplyr::bind_rows() |>
    tibble::as_tibble()
}

non_sp_with_errors <- list_non_sp_with_errors |>
  purrr::map("all_results") |>
  dplyr::bind_rows(.id = "dataset") |>
  janitor::clean_names() |>
  dplyr::mutate(query = stringr::str_replace_all(name, "_", " "), .after = name)

head(non_sp_with_errors)
# A tibble: 6 × 10
  dataset        id    name  query cardinality match_type best_result$dataSour…¹
  <chr>          <chr> <chr> <chr>       <int> <chr>                       <int>
1 Alberto_Gonza… af14… Chir… Chir…           1 Exact                           3
2 Alberto_Gonza… 22da… Pass… Pass…           1 Exact                           3
3 Alberto_Gonza… 79ee… Dide… Dide…           0 Exact                           3
4 Alberto_Gonza… 04e6… Momo… Momo…           0 Exact                           3
5 Ana_Delciellos 9fd2… Mamm… Mamm…           1 Exact                           3
6 Ana_Delciellos 2806… Aves  Aves            1 Exact                           3
# ℹ abbreviated name: ¹​best_result$dataSourceId
# ℹ 29 more variables: best_result$dataSourceTitleShort <chr>, $curation <chr>,
#   $recordId <chr>, $outlink <chr>, $entryDate <chr>, $sortScore <dbl>,
#   $matchedNameID <chr>, $matchedName <chr>, $matchedCardinality <int>,
#   $matchedCanonicalSimple <chr>, $matchedCanonicalFull <chr>,
#   $currentRecordId <chr>, $currentNameId <chr>, $currentName <chr>,
#   $currentCardinality <int>, $currentCanonicalSimple <chr>, …

The last step is to put together a full list of problems/errors independently if they are for full species or imprecise taxa. In this step we use the sp_whitelist to escape this terms that we think are correct.

Code
sp_with_errors |>
  dplyr::select(dataset, query, matched_canonical_simple, match_type = match_type_4) |>
  dplyr::filter(!query %in% sp_whitelist) |>
  dplyr::bind_rows(non_sp_with_errors) |>
  dplyr::arrange(dataset)
# A tibble: 221 × 11
   dataset       query matched_canonical_si…¹ match_type id    name  cardinality
   <chr>         <chr> <chr>                  <chr>      <chr> <chr>       <int>
 1 Alberto_Gonz… Chir… <NA>                   Exact      af14… Chir…           1
 2 Alberto_Gonz… Pass… <NA>                   Exact      22da… Pass…           1
 3 Alberto_Gonz… Dide… <NA>                   Exact      79ee… Dide…           0
 4 Alberto_Gonz… Momo… <NA>                   Exact      04e6… Momo…           0
 5 Ana_Delciell… Mamm… <NA>                   Exact      9fd2… Mamm…           1
 6 Ana_Delciell… Aves  <NA>                   Exact      2806… Aves            1
 7 Ana_Delciell… Leop… <NA>                   Exact      12ee… Leop…           0
 8 Ana_Delciell… Dasy… <NA>                   Exact      1f77… Dasy…           0
 9 Arteris_Lito… Aram… <NA>                   Exact      2c8c… Aram…           0
10 Arteris_Lito… Rode… <NA>                   Exact      d91e… Rode…           1
# ℹ 211 more rows
# ℹ abbreviated name: ¹​matched_canonical_simple
# ℹ 4 more variables: best_result <df[,27]>, data_sources_num <int>,
#   data_sources_ids <list>, curation <chr>